Midterm

Data Visualization (STAT 302)

Author

Shuo Han

Overview

The midterm attempts to bring together everything you have learned to date. You’ll be asked to replicate a series of graphics to demonstrate your skills and provide short descriptions/explanations regarding issues and concepts in ggplot2.

You are free to use any resource at your disposal such as notes, past labs, the internet, fellow students, instructor, TA, etc. However, do not simply copy and paste solutions. This is a chance for you to assess how much you have learned and determine if you are developing practical data visualization skills and knowledge.

Datasets

The datasets used for this dataset are stephen_curry_shotdata_2014_15.txt, ga_election_data.csv, ga_map.rda and the built in dataset palmerpenguins::penguins (need to install the palmerpenguins pacakge). Will also need the nbahalfcourt.jpg image.

Below you can find a short description of the variables contained in stephen_curry_shotdata_2014_15.txt:

  • GAME_ID - Unique ID for each game during the season
  • PLAYER_ID - Unique player ID
  • PLAYER_NAME - Player’s name
  • TEAM_ID - Unique team ID
  • TEAM_NAME - Team name
  • PERIOD - Quarter or period of the game
  • MINUTES_REMAINING - Minutes remaining in quarter/period
  • SECONDS_REMAINING - Seconds remaining in quarter/period
  • EVENT_TYPE - Missed Shot or Made Shot
  • SHOT_DISTANCE - Shot distance in feet
  • LOC_X - X location of shot attempt according to tracking system
  • LOC_Y - Y location of shot attempt according to tracking system

The ga_election_data.csv dataset contains the state of Georgia’s county level results for the 2020 US presidential election. Here is a short description of the variables it contains:

  • County - name of county in Georgia
  • Candidate - name of candidate on the ballot,
  • Election Day Votes - number of votes cast on election day for a candidate within a county
  • Absentee by Mail Votes - number of votes cast absentee by mail, pre-election day, for a candidate within a county
  • Advanced Voting Votes - number of votes cast in-person, pre-election day, for a candidate within a county
  • Provisional Votes - number of votes cast on election day for a candidate within a county needing voter eligibility verification
  • Total Votes - total number of votes for a candidate within a county

We have also included the map data for Georgia (ga_map.rda) which was retrieved using tigris::counties().

Run ?palmerpenguins::penguins in the console to get an overview of the penguins dataset.

Code
# load package(s)
library(tidyverse)
library(janitor)
library(patchwork)
library(palmerpenguins)
library(sf)
library(ggside)
library(ggthemes)

# load steph curry data
steph_curry <- read_delim(
  file = "data/stephen_curry_shotdata_2014_15.txt",
  delim = "|"
  ) %>%
  clean_names()
  
# load ga election & map data
ga_data <- read_csv("data/ga_election_data.csv") %>%
  clean_names()

load("data/ga_map.rda")
  
# load penguins dataset
 data("penguins")

Exercise 1

Using the stephen_curry_shotdata_2014_15.txt dataset replicate, as close as possible, the graphics below. After replicating the graphics provide a summary of what the graphics indicate about Stephen Curry’s shot selection (i.e. distance from hoop) and shot make/miss rate and how they relate/compare across distance and game time (i.e. across quarters/periods).

Plot 1

Hints:

  • Figure width 6 inches and height 4 inches, which is taken care of in code chunk yaml with fig-width and fig-height
  • Use minimal theme and adjust from there
  • While the plot needs to be very close to the one shown it does not need to be exact in terms of values. If you want to make it exact here are some useful values used, sometimes repeatedly, to make the plot: 12 & 14
Code
# data prep
steph_curry <- steph_curry %>%
  mutate(
    period = factor(
      period,
      levels = c(1, 2, 3, 4, 5),
      labels = c("Q1", "Q2", "Q3", "Q4", "OT")
      )
    )
Solution
Code
ggplot(steph_curry, aes(period, shot_distance)) +
  geom_boxplot(varwidth = TRUE) +
  facet_wrap(~event_type) +
  scale_x_discrete(
    name = "Quarter/Period",
  ) +
  scale_y_continuous(
    name = NULL,
    labels = scales::unit_format(unit = "ft"),
    limits = c(0, NA),
    breaks = seq(0, 40, 10),
    minor_breaks = NULL
  ) +
  labs(
    title = "Stephen Curry",
    subtitle = "2014-2015"
  )+
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    strip.text = element_text(face = "bold", size = 12),
    axis.title.x = element_text(face = "bold", size = 12),
    panel.grid.major.x = element_blank()
  )

Plot 2

Hints:

  • Figure width 6 inches and height 4 inches, which is taken care of in code chunk yaml with fig-width and fig-height
  • Use minimal theme and adjust from there
  • Useful hex colors: "#5D3A9B" and "#E66100"
  • No padding on vertical axis
  • Transparency is being used
  • annotate() is used to add labels
  • While the plot needs to be very close to the one shown it does not need to be exact in terms of values. If you want to make it exact here are some useful values used, sometimes repeatedly, to make the plot: 0, 0.04, 0.07, 0.081, 0.25, 3, 12, 14, 27
Solution
Code
ggplot(steph_curry, aes(shot_distance, fill = event_type, color = event_type)) +
  geom_density(alpha = 0.3) +
  scale_fill_manual(values = c("#5D3A9B", "#E66100")) +
  scale_color_manual(values = c("#5D3A9B", "#E66100"))+
  scale_x_continuous(
    name = NULL,
    breaks = seq(0, 40, 10),
    minor_breaks = NULL,
    labels = scales::unit_format(unit = "ft")
  ) +
  scale_y_continuous(
    name = NULL,
    breaks = NULL,
    limits = c(0, 0.081),
    expand = c(0, 0)
  ) +
  annotate(
    "text",
    x = 3,
    y = 0.04,
    label = "Made Shots",
    colour = "#5D3A9B",
    hjust = 0,
    vjust = 0
  ) +
  annotate(
    "text",
    x = 27,
    y = 0.07,
    label = "Missed Shots",
    colour = "#E66100",
    hjust = 0,
    vjust = 0
  ) +
  labs(
    title = "Stephen Curry",
    subtitle = "Shot Densities (2014-2015)"
  )+
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    strip.text = element_text(face = "bold", size = 12),
    axis.title.x = element_text(face = "bold", size = 12),
    legend.position = "none",
    panel.grid.major.x = element_blank()
  )

Plot 3

Hints:

  • Figure width 7 inches and height 7 inches, which is taken care of in code chunk yaml with fig-width and fig-height
  • Colors used: "grey", "red", "orange" "yellow" (don’t have to use "orange", you can get away with using only "red" and "yellow")
  • To top code so 15+ is the highest value, you need to set the limits in the appropriate scale while also also setting the na.value to the top color
  • While the plot needs to be very close to the one shown it does not need to be exact in terms of values. If you want to make it exact here are some useful values used, sometimes repeatedly, to make the plot: 0, 0.7, 5, 12, 14, 15, 20
Solution
Code
court <- grid::rasterGrob(
  jpeg::readJPEG(
    source = "data/nbahalfcourt.jpg"),
  width = unit(1, "npc"), 
  height = unit(1, "npc")
)

# plot
ggplot() +
  annotation_custom(
    grob = court,
    xmin = -250, xmax = 250,
    ymin = -52, ymax = 418
  ) +
  coord_fixed() +
  xlim(250, -250) +
  ylim(-52, 418) +
  geom_hex(
    data = steph_curry,
    aes(loc_x, loc_y),
    bins = 20,
    alpha = 0.7,
    color = "grey"
  ) +
  scale_fill_gradient(
    name = "Shot\nAttempts",
    low = "yellow",
    high = "red",
    breaks = c(0, 5, 10, 15),
    labels = c(0, 5, 10, "15+"),
    limits = c(0, 15),
    na.value = "red"
  ) +
  labs(
    subtitle = "Shot Chart (2014-2015)",
    title = "Stephen Curry"
  ) +
  theme_void() +
  theme(plot.title = element_text(face = "bold", size = 14))

Summary

Provide a summary of what the graphics above indicate about Stephen Curry’s shot selection (i.e. distance from hoop) and shot make/miss rate and how they relate/compare across distance and game time (i.e. across quarters/periods).

Solution

plot 1: For Stephen Curry, the distances for made shot tend to be smaller than missed shot from the plot. While for game time, there are really few shots in OT period for both missed and made, while for other 4 quarters, fewer in 2 and 4 and more in 1 and 3. Also, the distances from hoop in the OT period tend to be shorter than other periods. plot 2: Same as plot 1, the distances for made shot tend to be smaller than missed shot for Stephen Curry. Also, we can see that there are more made shots at the distance peak of 3ft, but there are more missed shits at the distance peak of 25ft. plot 3: Corresponding to plot 2, Stephen Curry made more shots in front of the basket, where he made more shots than missed shots. Also, he made many of his shots 25ft away from the baskets, where he missed more shots than made.

Exercise 2

Using the ga_election_data.csv dataset in conjunction with mapping data ga_map.rda replicate, as close as possible, the graphic below. Note the graphic is comprised of two plots displayed side-by-side. The plots both use the same shading scheme (i.e. scale limits and fill options).

Background Information: Holding the 2020 US Presidential election during the COVID-19 pandemic was a massive logistical undertaking. Voter engagement was extremely high which produced a historical high voting rate. Voting operations, headed by states, ran very monthly and encountered few COVID-19 related issues. The state of Georgia did a particularly good job at this by encouraging their residents to use early voting. About 75% of the vote in a typical county voted early! Ignoring county boundaries, about 4 in every 5 voters, 80%, in Georgia voted early.

While it is clear that early voting was the preferred option for Georgia voters, we want to investigate whether or not one candidate/party utilized early voting more than the other — we are focusing on the two major candidates. We created the graphic below to explore the relationship of voting mode and voter preference, which you are tasked with recreating.

After replicating the graphic provide a summary of how the two maps relate to one another. That is, what insight can we learn from the graphic.

Hints:

  • Figure width 7 inches and height 7 inches, which is taken care of in code chunk yaml with fig-width and fig-height
  • Make two plots, then arrange plots accordingly using patchwork package
  • patchwork::plot_annotation() will be useful for adding graphic title and caption; you’ll also set the theme options for the graphic title and caption (think font size and face)
  • ggthemes::theme_map() was used as the base theme for the plots
  • scale_*_gradient2() will be helpful
  • Useful hex colors: "#5D3A9B" and "#1AFF1A"
  • While the plot needs to be very close to the one shown it does not need to be exact in terms of values. If you want to make it exact here are some useful values used, sometimes repeatedly, to make the plot: 0.5, 0.75, 1, 10, 12, 14, 24
Important

Add comments to the code below where indicated. The added comments should concisely describe what the following line(s) of code do in the data wrangling process

Solution
Code
# data
ga_graph <- ga_data %>% 
  # create a new column called prop_pre_eday in the data frame and computes its values as the sum of two columns absentee_by_mail_votes and advanced_voting_votes divided by the total_votes column
  mutate(
    prop_pre_eday = (absentee_by_mail_votes + advanced_voting_votes) / total_votes
  ) %>% 
  # selects all columns except those that contain the string "_vote" in their column names
  select(-contains("_vote")) 

# biden map data
biden_map_data <- ga_map %>% 
  #  join two data frames based on "county" in the left data frame and "name" in the right data frame for information about Joseph R. Biden
  left_join(
    ga_graph %>% 
      filter(candidate == "Joseph R. Biden"),
    by = c("name" ="county")
  )

# trump map data
trump_map_data <- ga_map %>% 
  # merge these two data frames based on a common column name in the first data frame and county in the second data frame for information about Donald J. Trump
  left_join(
    ga_graph %>% 
      filter(candidate == "Donald J. Trump"),
    by = c("name" ="county")
  )

# biden plot
biden <- ggplot(biden_map_data) +
  geom_sf(aes(fill = prop_pre_eday), show.legend = FALSE) +
  scale_fill_gradient2(
    high = "#5D3A9B",
    low = "#1AFF1A",
    midpoint = 0.75,
    limits = c(0.5, 1),
  ) +
  theme_map() +
  ggtitle("Joseph R. Biden") +
  labs(subtitle = "Democratic Nominee") +
  theme(
    plot.title = element_text(face = "bold", size = 14)
  )

# trump plot
trump <- ggplot(trump_map_data) +
  geom_sf(aes(fill = prop_pre_eday)) +
  scale_fill_gradient2(
    name = NULL,
    high = "#5D3A9B",
    low = "#1AFF1A",
    midpoint = 0.75,
    limits = c(0.5, 1),
    breaks = c(0.5, 0.75, 1),
    labels = scales::percent_format()
  ) +
  theme_map() +
  labs(
    title = "Donald J. Trump",
    subtitle = "Republican Nominee") +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    legend.justification = c(1, 1),
    legend.position = c(1, 1)
  )

# final plot
final <- biden + trump +
  plot_annotation(
    title = "Percentage of votes from early voting",
    caption = "Georgia:2021 US Presidents Election Results",
    theme = theme(
      plot.title = element_text(face = "bold", size = 24),
      plot.caption = element_text(size = 12)
    )
  )

final

Summary

Provide a summary of how the two maps relate to one another. That is, what insight can we learn from the graphic.

Solution

Considering only votes, these two maps above shows two opposite information. If one votes for Joseph R. Biden, then there will be no vote for Donald J. Trump, so these two are corresponding plots.

One thing that makes using ggplot2 in R so powerful is the ever expanding set of extension packages. At this point, students should have the skills and know how to be able to make use of extension packages. The focus of this exercise is for students to demonstrate their ability to make use of a ggplot2 extension package.

For this exercise you will need the penguins dataset from the palmerpenguins package. The extension package is ggside. A little internet searching and you should be able to find the documentation for ggside and its GitHub` repo with very useful examples.

Recreate the graphic below as closely as possible

Hints:

  • alpha values: 0.6 and 0.3
  • scale for the side panels: 0.3
  • should be able to track down an example that is very similar to this
Solution
Code
ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g, fill = species)) +
  geom_point(aes(color = species), alpha = 0.6) +
  geom_xsidedensity(alpha = .3, position = "stack") +
  geom_ysideboxplot(aes(x = species), orientation = "x") +
  scale_ysidex_discrete(guide = guide_axis(angle = 45)) +
  labs(x = "Flipper Length (mm)", y = "Body Mass (grams)") +
  theme_minimal() +
  theme(ggside.panel.scale = .3,
        legend.position = "none")